AITopics | text stream

Collaborating Authors

text stream

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Dynamic Rank Factor Model for Text Streams

Neural Information Processing SystemsDec-27-2025, 15:05:13 GMT

We propose a semi-parametric and dynamic rank factor model for topic modeling, capable of (1) discovering topic prevalence over time, and (2) learning contemporary multi-scale dependence structures, providing topic and word correlations as a byproduct. The high-dimensional and time-evolving ordinal/rank observations (such as word counts), after an arbitrary monotone transformation, are well accommodated through an underlying dynamic sparse factor model. The framework naturally admits heavy-tailed innovations, capable of inferring abrupt temporal jumps in the importance of topics. Posterior inference is performed through straightforward Gibbs sampling, based on the forward-filtering backward-sampling algorithm. Moreover, an efficient data subsampling scheme is leveraged to speed up inference on massive datasets. The modeling framework is illustrated on two real datasets: the US State of the Union Address and the JSTOR collection from Science.

dynamic rank factor model, name change, text stream, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.79)

Add feedback

Dynamic Rank Factor Model for Text Streams

Neural Information Processing SystemsSep-30-2025, 08:19:01 GMT

dynamic rank factor model, name change, text stream, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.79)

Add feedback

WhisperKit: On-device Real-time ASR with Billion-Scale Transformers

Orhon, Atila, Okan, Arda, Durmus, Berkin, Nagengast, Zach, Pacheco, Eduardo

arXiv.org Artificial IntelligenceJul-16-2025

Real-time Automatic Speech Recognition (ASR) is a fundamental building block for many commercial applications of ML, including live captioning, dictation, meeting transcriptions, and medical scribes. Accuracy and latency are the most important factors when companies select a system to deploy. We present WhisperKit, an optimized on-device inference system for real-time ASR that significantly outperforms leading cloud-based systems. We benchmark against server-side systems that deploy a diverse set of models, including a frontier model (OpenAI gpt-4o-transcribe), a proprietary model (Deepgram nova-3), and an open-source model (Fireworks large-v3-turbo).Our results show that WhisperKit matches the lowest latency at 0.46s while achieving the highest accuracy 2.2% WER. The optimizations behind the WhisperKit system are described in detail in this paper.

large language model, latency, machine learning, (23 more...)

arXiv.org Artificial Intelligence

2507.1086

Country: North America > United States > California (0.46)

Genre: Research Report > New Finding (0.86)

Industry: Information Technology (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)

Add feedback

Agent-based visualization of streaming text

Benson, Jordan Riley, Crist, David, Lafleur, Phil, Watson, Benjamin

arXiv.org Artificial IntelligenceJul-15-2025

We present a visualization infrastructure that maps data elements to agents, which have behaviors parameterized by those elements. Dynamic visualizations emerge as the agents change position, alter appearance and respond to one other. Agents move to minimize the difference between displayed agent-to-agent distances, and an input matrix of ideal distances. Our current application is visualization of streaming text. Each agent represents a significant word, visualizing it by displaying the word itself, centered in a circle sized by the frequency of word occurrence. We derive the ideal distance matrix from word cooccurrence, mapping higher co-occurrence to lower distance. To depict co-occurrence in its textual context, the ratio of intersection to circle area approximates the ratio of word co-occurrence to frequency. A networked backend process gathers articles from news feeds, blogs, Digg or Twitter, exploiting online search APIs to focus on user-chosen topics. Resulting visuals reveal the primary topics in text streams as clusters, with agent-based layout moving without instability as data streams change dynamically.

artificial intelligence, information management, visualization, (19 more...)

arXiv.org Artificial Intelligence

2507.08884

Country: North America > United States (0.15)

Genre:

Instructional Material > Online (0.62)
Instructional Material > Course Syllabus & Notes (0.62)

Industry: Media > News (0.35)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)

Add feedback

RiverText: A Python Library for Training and Evaluating Incremental Word Embeddings from Text Data Streams

Iturra-Bocaz, Gabriel, Bravo-Marquez, Felipe

arXiv.org Artificial IntelligenceJul-1-2025

Word embeddings have become essential components in various information retrieval and natural language processing tasks, such as ranking, document classification, and question answering. However, despite their widespread use, traditional word embedding models present a limitation in their static nature, which hampers their ability to adapt to the constantly evolving language patterns that emerge in sources such as social media and the web (e.g., new hashtags or brand names). To overcome this problem, incremental word embedding algorithms are introduced, capable of dynamically updating word representations in response to new language patterns and processing continuous data streams. This paper presents RiverText, a Python library for training and evaluating incremental word embeddings from text data streams. Our tool is a resource for the information retrieval and natural language processing communities that work with word embeddings in streaming scenarios, such as analyzing social media. The library implements different incremental word embedding techniques, such as Skip-gram, Continuous Bag of Words, and Word Context Matrix, in a standardized framework. In addition, it uses PyTorch as its backend for neural network training. We have implemented a module that adapts existing intrinsic static word embedding evaluation tasks for word similarity and word categorization to a streaming setting. Finally, we compare the implemented methods with different hyperparameter settings and discuss the results. Our open-source library is available at https://github.com/dccuchile/rivertext.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3539618.3591908

2506.23192

Country:

Asia > Taiwan > Taiwan Province > Taipei (0.05)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Oceania > New Zealand (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

WorryWords: Norms of Anxiety Association for over 44k English Words

Mohammad, Saif M.

arXiv.org Artificial IntelligenceNov-6-2024

Anxiety, the anticipatory unease about a potential negative outcome, is a common and beneficial human emotion. However, there is still much that is not known, such as how anxiety relates to our body and how it manifests in language. This is especially pertinent given the increasing impact of anxiety-related disorders. In this work, we introduce WorryWords, the first large-scale repository of manually derived word--anxiety associations for over 44,450 English words. We show that the anxiety associations are highly reliable. We use WorryWords to study the relationship between anxiety and other emotion constructs, as well as the rate at which children acquire anxiety words with age. Finally, we show that using WorryWords alone, one can accurately track the change of anxiety in streams of text. The lexicon enables a wide variety of anxiety-related research in psychology, NLP, public health, and social sciences. WorryWords (and its translations to over 100 languages) is freely available. http://saifmohammad.com/worrywords.html

correlation, emotion, worryword, (17 more...)

arXiv.org Artificial Intelligence

2411.03966

Country:

Asia > India (0.04)
Europe > Middle East > Malta (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(11 more...)

Genre: Research Report (0.64)

Industry:

Health & Medicine > Therapeutic Area > Neurology (0.68)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.93)
Information Technology > Communications > Social Media > Crowdsourcing (0.68)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.48)

Add feedback

Evolving Text Data Stream Mining

Kumar, Jay

arXiv.org Artificial IntelligenceAug-15-2024

A text stream is an ordered sequence of text documents generated over time. A massive amount of such text data is generated by online social platforms every day. Designing an algorithm for such text streams to extract useful information is a challenging task due to unique properties of the stream such as infinite length, data sparsity, and evolution. Thereby, learning useful information from such streaming data under the constraint of limited time and memory has gained increasing attention. During the past decade, although many text stream mining algorithms have proposed, there still exists some potential issues. First, high-dimensional text data heavily degrades the learning performance until the model either works on subspace or reduces the global feature space. The second issue is to extract semantic text representation of documents and capture evolving topics over time. Moreover, the problem of label scarcity exists, whereas existing approaches work on the full availability of labeled data. To deal with these issues, in this thesis, new learning models are proposed for clustering and multi-label learning on text streams.

dataset, electronic science and technology, text stream, (14 more...)

arXiv.org Artificial Intelligence

2409.0001

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.13)
(42 more...)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)
Research Report > Experimental Study (0.92)
Research Report > Promising Solution (0.67)

Industry:

Information Technology (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
(4 more...)

Add feedback

Write Summary Step-by-Step: A Pilot Study of Stepwise Summarization

Chen, Xiuying, Gao, Shen, Li, Mingzhe, Zhu, Qingqing, Gao, Xin, Zhang, Xiangliang

arXiv.org Artificial IntelligenceJun-8-2024

Nowadays, neural text generation has made tremendous progress in abstractive summarization tasks. However, most of the existing summarization models take in the whole document all at once, which sometimes cannot meet the needs in practice. Practically, social text streams such as news events and tweets keep growing from time to time, and can only be fed to the summarization system step by step. Hence, in this paper, we propose the task of Stepwise Summarization, which aims to generate a new appended summary each time a new document is proposed. The appended summary should not only summarize the newly added content but also be coherent with the previous summary, to form an up-to-date complete summary. To tackle this challenge, we design an adversarial learning model, named Stepwise Summary Generator (SSG). First, SSG selectively processes the new document under the guidance of the previous summary, obtaining polished document representation. Next, SSG generates the summary considering both the previous summary and the document. Finally, a convolutional-based discriminator is employed to determine whether the newly generated summary is coherent with the previous summary. For the experiment, we extend the traditional two-step update summarization setting to a multi-step stepwise setting, and re-propose a large-scale stepwise summarization dataset based on a public story generation dataset. Extensive experiments on this dataset show that SSG achieves state-of-the-art performance in terms of both automatic metrics and human evaluations. Ablation studies demonstrate the effectiveness of each module in our framework. We also discuss the benefits and limitations of recent large language models on this task.

dataset, information, summarization, (14 more...)

arXiv.org Artificial Intelligence

2406.05361

Country:

Asia > China (0.04)
Asia > Middle East > Saudi Arabia (0.04)

Genre: Research Report (1.00)

Industry: Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)

Add feedback

Improving Sampling Methods for Fine-tuning SentenceBERT in Text Streams

Garcia, Cristiano Mesquita, Koerich, Alessandro Lameiras, Britto, Alceu de Souza Jr, Barddal, Jean Paul

arXiv.org Artificial IntelligenceMar-18-2024

The proliferation of textual data on the Internet presents a unique opportunity for institutions and companies to monitor public opinion about their services and products. Given the rapid generation of such data, the text stream mining setting, which handles sequentially arriving, potentially infinite text streams, is often more suitable than traditional batch learning. While pre-trained language models are commonly employed for their high-quality text vectorization capabilities in streaming contexts, they face challenges adapting to concept drift - the phenomenon where the data distribution changes over time, adversely affecting model performance. Addressing the issue of concept drift, this study explores the efficacy of seven text sampling methods designed to selectively fine-tune language models, thereby mitigating performance degradation. We precisely assess the impact of these methods on fine-tuning the SBERT model using four different loss functions. Our evaluation, focused on Macro F1-score and elapsed time, employs two text stream datasets and an incremental SVM classifier to benchmark performance. Our findings indicate that Softmax loss and Batch All Triplets loss are particularly effective for text stream classification, demonstrating that larger sample sizes generally correlate with improved macro F1-scores. Notably, our proposed WordPieceToken ratio sampling method significantly enhances performance with the identified loss functions, surpassing baseline results.

dataset, loss function, sample size, (14 more...)

arXiv.org Artificial Intelligence

2403.15455

Country:

North America > United States > New York (0.04)
South America > Brazil > Santa Catarina (0.04)
South America > Brazil > Paraná > Curitiba (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.48)

Add feedback

Methods for Generating Drift in Text Streams

Garcia, Cristiano Mesquita, Koerich, Alessandro Lameiras, Britto, Alceu de Souza Jr, Barddal, Jean Paul

arXiv.org Artificial IntelligenceMar-18-2024

Systems and individuals produce data continuously. On the Internet, people share their knowledge, sentiments, and opinions, provide reviews about services and products, and so on. Automatically learning from these textual data can provide insights to organizations and institutions, thus preventing financial impacts, for example. To learn from textual data over time, the machine learning system must account for concept drift. Concept drift is a frequent phenomenon in real-world datasets and corresponds to changes in data distribution over time. For instance, a concept drift occurs when sentiments change or a word's meaning is adjusted over time. Although concept drift is frequent in real-world applications, benchmark datasets with labeled drifts are rare in the literature. To bridge this gap, this paper provides four textual drift generation methods to ease the production of datasets with labeled drifts. These methods were applied to Yelp and Airbnb datasets and tested using incremental classifiers respecting the stream mining paradigm to evaluate their ability to recover from the drifts. Results show that all methods have their performance degraded right after the drifts, and the incremental SVM is the fastest to run and recover the previous performance levels regarding accuracy and Macro F1-Score.

arf, classifier, dataset, (15 more...)

arXiv.org Artificial Intelligence

2403.12328

Country:

North America > United States > New York (0.04)
South America > Brazil > Santa Catarina (0.04)
South America > Brazil > Paraná > Curitiba (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback